Detection of live breast cancer cells in bright-field microscopy images containing white blood cells by image analysis and deep learning

Abstract. Significance Circulating tumor cells (CTCs) are important biomarkers for cancer management. Isolated CTCs from blood are stained to detect and enumerate CTCs. However, the staining process is laborious and moreover makes CTCs unsuitable for drug testing and molecular characterization. Aim The goal is to develop and test deep learning (DL) approaches to detect unstained breast cancer cells in bright-field microscopy images that contain white blood cells (WBCs). Approach We tested two convolutional neural network (CNN) approaches. The first approach allows investigation of the prominent features extracted by CNN to discriminate in vitro cancer cells from WBCs. The second approach is based on faster region-based convolutional neural network (Faster R-CNN). Results Both approaches detected cancer cells with higher than 95% sensitivity and 99% specificity with the Faster R-CNN being more efficient and suitable for deployment presenting an improvement of 4% in sensitivity. The distinctive feature that CNN uses for discrimination is cell size, however, in the absence of size difference, the CNN was found to be capable of learning other features. The Faster R-CNN was found to be robust with respect to intensity and contrast image transformations. Conclusions CNN-based DL approaches could be potentially applied to detect patient-derived CTCs from images of blood samples.

these rare cells. Current CTC isolation techniques may be classified into two broad categories: affinity-based (labeled) and label-free techniques. Technologies based on the former approach rely on the interaction of cell-surface receptors on CTCs and specific antibodies, leading to the isolation of only the CTCs that bind to these antibodies. [10][11][12][13] In contrast, label-free approaches isolate CTCs based on physical characteristics, e.g., size, [14][15][16] deformability, 17 dielectric properties, 18,19 and density. 20 Despite the development of numerous separation methods for isolating CTCs from whole blood, the enriched CTC samples are almost always accompanied by a large number of white blood cells (WBCs). [21][22][23][24][25][26] The presence of these background nucleated cells makes it necessary to use molecular markers and immunostaining to distinguish CTCs from WBCs. However, the immunostaining process kills CTCs; thus, making it impossible to use them in downstream assays such as expansion studies or single-cell transcriptomics, all of which require the CTCs to be alive and functional. [27][28][29] Thus, there is a need for staining-free approaches that can enumerate live CTCs in a background of WBCs.
Current approaches for detecting stained CTCs range from manual scoring to techniques based on image processing and machine learning. There is a growing interest in using machine learning to detect and classify cells in patient blood samples as it eliminates drawbacks of manual scoring and threshold-based object detection. [30][31][32] For example, recently, Zeune et al. 33 developed a deep learning (DL) network to automatically identify and classify fluorescently stained CTCs obtained from the blood of metastatic cancer patients with a reported accuracy of over 96%. DL networks are attractive as they do not require any feature engineering but are capable of automatically discovering the best set of features that discriminate the cells of interest and can directly identify them from raw input images. 34 Given the success of DL models to identify CTCs in stained images, new efforts are emerging to implement similar approaches in unstained samples. However, since CTCs are quite heterogeneous and are present in low counts in patient samples, generating ground truth data is challenging. Thus, as a first step, DL-based frameworks are being pursued where in vitro cancer cells are mixed with blood cells, and the efficacy of DL models is being investigated to detect live cancer cells in a background of blood cells [35][36][37][38] As shown in Table 1, limited studies have been conducted on DL-based frameworks for cancer cell identification from a background of blood cells, with images acquired using quantitative phase, dark-field, and bright-field microscopy. A notable study of this kind is the study by Wang et al., 35 who demonstrated detection of live CTCs in the blood of renal cancer patients as well as in vitro human colorectal cancer cells (HCT-116) using a deep convolutional neural network (CNN) and reported an accuracy of 88.6%.
In this study, we utilize CNN approaches to demonstrate an automatic, sophisticated detection model for Michigan cancer foundation-7 (MCF-7) breast cancer cells in bright-field images of mixed population samples containing MCF-7 cells and WBCs. We develop two CNN models: (i) the first approach referred to as the "decoupled cell detection" allows us to investigate the prominent features extracted by CNN to discriminate cancer cells from WBCs. To the best of our knowledge, the features CNN uses to discriminate cells have not been explained by previous studies. (ii) The second approach is based on faster region-based convolutional neural network 39 referred to as the Faster R-CNN, which is more efficient and therefore suitable for deployment. Additionally, we introduce a novel automatic technique to generate the training set for the Faster R-CNN approach. This algorithm employs the bright-field image and the associated fluorescent images as ground truth. Additionally, we investigate the effect of image transformation on the classification accuracy to assess how well the DL model performs when variability is introduced in image quality. Finally, we discuss avenues for moving beyond the model system studied here and the challenges that need to be addressed to apply DL techniques to live CTC detection in patient blood samples.

Sample Preparation, Image Acquisition, and Data Sets
In this section, we discuss the methods for culturing cancer cells, isolating WBCs, and how we generated the image data sets of pure cell populations and mixed cell populations. We also clearly outline the image data sets used in developing our machine learning methods.

MCF-7 Cell Staining
The cells were labeled with CellTracker™ Green CMFDA (Invitrogen™, Waltham, Massachusetts). The dye was prepared according to the manufacturer's protocol to make a stock solution of 10 mM. A working solution of 1 μM was made by diluting the stock solution in serum-free media. This working solution was added to the tissue culture flask containing cells followed by an incubation step for 45 min at 37. After incubation, the cells were washed with phosphate buffer saline (Gibco, Gaithersburg, Maryland) three times to get rid of any excess dye. The cells were further trypsinized, resuspended in fresh media, and stained with the nuclear stain 4′,6-diamidino-2-phenylindole (DAPI, Molecular Probes, Eugene, Oregon). A working concentration of 100;000 cells∕ml was used for the experiments.

WBC Isolation and Staining
Normal human whole blood was ordered from BioIVT (Hicksville, New York) for isolation of WBCs from whole blood. 1 ml of whole blood was lysed using ammonium-chloride-potassium (ACK) lysing buffer (Quality Biological, Gaithersburg, Maryland) for isolation of WBCs. WBCs were stained with the nuclear stain DAPI. A working concentration of 3.6 × 10 6 cells∕ml was used for the experiments.

Pure Cell Population Images
Slides of MCF-7 cells were prepared by adding 10 μl of working solution of MCF-7 cells on the cover glass (Richard-Allan Scientific, Kalamazoo, Michigan) equipped with 10 mm × 24 mm spacers on the edges. This solution was then sandwiched using another cover glass and imaged using Olympus IX81 microscope (Massachusetts). The microscope is equipped with a Thorlabs automated stage (New Jersey) and a Hamamatsu digital camera (ImagEM X2 EM-CCD, New Jersey) controlled by SlideBook 6.1 (3i Intelligent Imaging Innovations Inc., Denver). Bright-field and fluorescence images were acquired using 20× objective (0.8 μm∕pixel, 512 × 512 pixels) with the fluorescent filters DAPI and fluorescein isothiocyanate (FITC). Exposure times between 30 and 200 ms were used for image acquisition. Similarly, slides of WBCs were prepared by adding 10 μl of working solution of WBCs on the sandwiched cover glass equipped with spacers. Bright-field and fluorescence images were acquired using 20× objective (0.8 μm∕pixel, 512 × 512 pixels) and the fluorescent filter DAPI. Exposure times use for the WBC pure cell training set image acquisition were 30 to 100 ms.

Mixed Cell Population Images
Slides for imaging of mixed cell populations were prepared by mixing 5 μl of WBC working solution with 5 μl of MCF-7 working solution. This 10 μl solution was added to the cover glass and imaged using the similar sandwich method as described previously. 40,41 Bright-field and fluorescent (DAPI, FITC) image acquisition was done at 20× magnification (0.8 μm∕pixel, 512 × 512 pixels). Images were acquired between 30-to 200-ms exposure times. Cell body and nucleus of MCF-7 cells were fluorescently labeled using CellTracker™ Green CMFDA and DAPI, respectively, whereas those of WBCs were only labeled with DAPI, enabling distinction of the two cell types under fluorescence imaging.

Image Data Sets
The bright-field images of MCF-7s and WBCs were accompanied by ground truth DAPI and FITC fluorescent images for exploration, training, and evaluation of DL models used in this study. We acquired 22 pure MCF-7 cell images and 21 pure WBC images totaling 1700 MCF-7 cells and 11,518 WBCs. For the mixed cell population, we acquired 592 images that contained 4722 MCF-7 cells and 12,314 WBC cells. Figures 1(a) and 1(b) show examples of pure cell and mixed cell bright-field images along with their associated fluorescent counterparts. Figure 2 shows the details and statistics of the image sets that are used in the following sections.

Results and Discussion
In this study, we investigated two CNN-based approaches to detect MCF-7 cancer cells from a background of WBCs. In the first approach, referred to as decoupled cell detection, all cells are localized using the maximally stable extremal regions (MSER) algorithm, 42 and then classified using a trained CNN, i.e., the cell localization and identification tasks are carried out in two distinct steps. The benefit of the decoupled approach is that it allows investigation of the distinctive features that convolutional layers extract to distinguish MCF-7 cells and WBCs.  Additionally, in the decoupled approach, one can differentiate the localization and classification error rates.
In the second approach, we detect MCF-7 cells using a Faster R-CNN model. In this approach, the cell localization and identification tasks are integrated without having separate access to the output of the localization step or the input of the classification step. As a result, it is difficult with Faster R-CNN model to understand the features being used to discriminate cells. However, due to its speed and efficiency of execution, the Faster R-CNN approach is better suited for deployment.
Below, we discuss the results from our investigations with these two approaches to detect MCF-7 cells in a background of WBCs. We note that the network performance is reported in terms of accuracy, sensitivity, and precision. 43 Here, true positives (TPs) were defined as the number of detected MCF-7 cells. False negatives (FNs) were defined as the number of MCF-7 cells missed in the detection procedure, and false positives (FPs) were defined as the number of detections that did not correspond to MCF-7 cells.

Decoupled Cell Detection
In this approach, the MCF-7 cells are detected in two decoupled modules: cell localization and cell identification modules. To detect the MCF-7 cells in a given bright-field image: (1) the bright-field image is fed to the localization module to localize the cells using the MSER algorithm, and (2) tiles of localized single cells are provided to the identification module to distinguish MCF-7 cells employing a trained CNN. The architecture and training details of the CNN is discussed in the following sections. As noted earlier, one of the purposes of developing this framework was to study the distinctive features that convolutional layers extract to distinguish MCF-7 cells and WBCs. Therefore, it was essential to have access to the input of the CNN classifier in order to solely study and evaluate its performance.
To develop the decoupled cell detection approach, we first designed and tested a shallow CNN using tiles of individual MCF-7 cells and WBCs. Subsequently, we evaluated the prominent cell features that allows successful classification with this trained CNN. Next, we optimized the CNN architecture to improve the detection of MCF-7 cells. Finally, we applied the decoupled cell detection approach to mixed cell images, i.e., individual images that contained a mixture of MCF-7 cells and WBCs, i.e., Fig. 2(b). The results from this systematic investigation are discussed below.

Shallow CNN efficiently discriminates MCF-7 cells from WBCs
To generate the training and testing set for the shallow CNN, the MSER algorithm was applied to localize the cells in the "pure" bright-field images in set I, i.e., images containing only MCF-7 cells or WBCs. For each localized cell, a 36 × 36 tile was cropped with the cell at the center. The tile size was determined based on the MCF-7 estimated cell size distribution [ Fig. 1(c)] and image resolution. This process ensured that both WBCs and MCF-7 cells were contained entirely in the cropped tiles. The tiles were carefully inspected, verified, and labeled manually employing the FITC and DAPI masks. Those tiles with a complete WBC located in the center of the corresponding tile were labeled "WBC." Similarly, tiles with a complete MCF-7 cell positioned in the central region of the corresponding tile were tagged "MCF7." Set I-training set was formed by creating balanced classes of "WBC" and MCF7 tiles, each class comprising 1190 tiles. Set Itesting set contained 510 tiles per class.
We initially designed and trained a shallow CNN with two convolutional layers followed by a fully connected layer to classify the tiles into WBC and MCF7 categories. The convolutional layers had 5 × 5 kernel sizes followed by a 2 × 2 pooling layer to minimize the encoding depth due to the relatively small size of the tiles. The trained CNN was found to achieve satisfactory performance with a training accuracy of 99.51% and a test accuracy of 99.54%. These performance metrics justified the use of the relatively shallow architecture of the CNN as it achieved a high accuracy level without overfitting. In addition, the class probability histograms showed the network's high confidence level of above 99% on classification decisions. We concluded that the trained CNN successfully performed the discrimination task.

Identifying the discriminatory cellular features
We investigated the cellular features that enabled the successful classification by the shallow CNN. Since in Figs. 1(c) and 1(d) we observed that the mean diameter of MCF-7 cells is approximately twice as large as the mean diameter of WBCs, we hypothesized that cell size could be a distinguishing feature. To test this hypothesis, we made the size of WBCs similar to MCF-7 cells. This was achieved by doubling the size of set I-testing WBC tiles, which doubled the size of the WBCs contained in them, and then they were cropped to make them compatible with the network input. When we tested the trained CNN, i.e., the model trained on non-modified MCF7 and WBC tiles, on the newly generated WBC tiles, we observed a deterioration of 61% in the classification accuracy, which suggested that the trained CNN indeed relied on the size feature to make the classification decision. In other words, a logical explanation for why the network decided that resized WBCs look less like WBCs and more like MCF7s was because it learned to use size as the main feature of distinction.
Next, we hypothesized that in addition to cell size, the shallow CNN could extract other geometric and photometric features such as texture of the cell to perform the classification task successfully. To test this hypothesis, we eliminated the size difference between the cell types and trained our CNN in two ways. (i) We doubled the size of set I-training WBC tiles and used them together with set I-training MCF7 tiles to train the CNN. In doing so, we achieved a 99.33% training accuracy and a 98.14% testing accuracy. (ii) Alternatively, set I-training MCF7 tiles were halved in size and combined with set I-training WBC tiles to train the CNN. The new network designed to carry out this experiment processed 15 × 15 (not 36 × 36) input images. To meet this requirement, MCF7 tiles were simply resized by a factor of 0.5 and WBC tiles were cropped. This resulted in a 98.32% training and a 96.86% test accuracy. Note that in each case, set I-testing tiles were modified the same way as set I-training tiles.
The high training and testing accuracy suggest that although size is a prominent discriminatory feature used by the designed CNN, the network does utilize other features such as texture to perform cell classification with a high, albeit reduced, level of accuracy if necessary. This result helps explain the model's ability to distinguish the two cell types, even when there is an overlap in the cell size distributions [ Fig. 1(c) and 1(d)].

Optimizing the CNN architecture
The training was carried out on two different CNN architectures, one with two and the other with three convolutional layers while keeping the kernel sizes as 5 × 5 and pooling layer as 2 × 2.
To boost the TP detection rate of the CNN, we tuned the MSER parameters to maximize the localization of MCF-7 cells at the expense of a proportional increase in the overall number of tiles that were output by the MSER module. The classification task was then recast as a two-class problem in which one class represented MCF-7 cells, and the second represented all non-MCF-7 objects that included WBCs and debris. The results showed that the CNN with three convolutional layers had testing sensitivity and precision of 98.8%, outperformed the CNN with two layers by more than 3%. We therefore used the CNN with three convolution layers in subsequent studies for classification in the decoupled cell detection approach.
So far, our investigations were focused on the pure cell images in set I to develop and optimize a shallow CNN and identify cellular features that enable classification. Next, we trained and evaluated the decoupled cell detection approach, as shown in Fig. 3, on images containing both MCF-7 cells and WBCs, i.e., set II images as described in Fig. 2(b). For training purposes, 60% of set II images were randomly chosen and used, which hereafter will be referred to as the "training images." The remaining 40% of set II images were used to evaluate the developed framework, i.e., the "testing images." In the cell localization module, i.e., module 1, the MSER algorithm is applied to the input image to localize the objects. The MSERs are those connected regions that do not change in size and morphology as one incrementally thresholds the image over a wide range of thresholding values. There are three parameters to be adjusted for the MSER algorithm: (1) size range, i.e., smallest and largest extremal regions to be selected, (2) stability testing range, i.e., minimum number of thresholding iterations that results in a stable extremal region, and (3) maximum variation, i.e., the maximum size variation that is allowed for the region to be counted as stable.
To detect the MSERs, the image is iteratively binarized at different thresholding levels and connected regions are estimated. The detector selects any connected region that is within the size range and exhibits less area variation than the maximum variation parameter within the specified stability testing range. In this framework, the size range was set based on the lower and upper bounds of the MCF-7 size distribution to maximize the localization of MCF-7 cells. Specifically, this parameter was set to [lower bound −5%, upper bound þ5%]. Outputs from the localization module are 36 × 36 tiles centered on the corresponding objects. The tile size was calculated based on the upper bound of the MCF-7 cell size distribution and image resolution.
In the cell identification module, i.e., Module 2, the cropped tiles are fed to the trained CNN model that, as described above, consists of three convolutional layers and a fully connected layer. The convolutional layers have a 5 × 5 kernel size, followed by a 2 × 2 pooling layer. The fully connected layer outputs a two-dimensional vector containing the MCF7 and "non-MCF7" class prediction scores.
The training and testing set label assignment workflow for developing the decoupled cell detection framework is shown in Fig. 4. In this workflow, first, MSER was applied to each input image, and tiles of localized objects were generated. Afterward, the cropped tiles were manually  categorized based on their corresponding signatures in DAPI and FITC masks. Every localization depicting a signature in both DAPI and FITC masks was labeled as MCF7 since cancer cells were live stained with both a nucleus (DAPI) and a cell body (FITC) marker, whereas tiles lacking any signature in the FITC mask were annotated as non-MCF7. The input to this workflow is the training or testing images, i.e., Fig. 2(b), and the generated training or testing set is a collection of cell tiles, each associated with a label.
We used the training images to generate the training set containing 2760 MCF7 tiles and 2760 tiles in the non-MCF7 class. The CNN was trained using the ADAM optimizer for 35 epochs. The training process took 74 s on an Nvidia TITAN RTX. The trained classification CNN achieved 98.99% training sensitivity and 99.8% training precision.

Performance of the decoupled cell detection approach
We evaluated the decoupled approach on the generated testing set, i.e., a set of 1962 MCF7 tiles and 1962 non-MCF7 tiles, using three performance metrics: sensitivity, precision, and average execution time per image as shown in Table 2. Overall, the decoupled approach performed with 95.5% accuracy, 95.3% sensitivity, and 99.8% precision, which indicates its ability to successfully achieve the MCF-7 identification task. Figure 5 shows representative examples of TPs, FPs, and FNs observed during the analysis. In the decoupled approach, MCF-7 cells localized and correctly classified were counted toward TPs [ Fig. 5(a)]. The combination of (i) the MCF-7 cells   We further investigated the source of FNs. As shown in Table 2, the MSER localization module is unable to localize 71 MCF-7 cells, whereas the identification module misclassifies 22 MCF-7 cells. Therefore, the localization errors are responsible for more than 75% of the FNs. The majority of FNs encountered during identification are MCF-7 cells whose contrast, shape, and size differ from an average MCF-7 cell [see examples in Fig. 5(e)]. These two observations indicate that the convolutional layers of the CNN classifier were able to grasp the common features of MCF-7 cells for identification. As observed, the decoupled nature of this approach enabled us to trace back the missed detections of the CNN classifier and study them to gain insights into the discriminatory power of features that convolution layers extract in classifying MCF-7 cells versus WBCs.

Faster R-CNN Cell Detection
The decoupled cell detection approach performs the MCF-7 localization and identification in two steps. However, the sensitivity of this approach is being affected by the large number of FNs due to localization errors. We therefore explored a Faster R-CNN model, which integrates these two steps without having separate access to the output of the localization step or the input of the classification step. The Faster R-CNN model takes a bright-field image as input and outputs a bounding box (BB) for every MCF-7 cell present in the input image (Fig. 6). Two major modules underlie the Faster R-CNN model: (1) a localization module, i.e., a region proposal network (RPN) that discovers regions in the input image that are likely to include an MCF-7 cell; and (2) a classification module, i.e., Fast RCNN that labels the region proposals, which essentially include an MCF-7 cell and estimates a BB that properly confines the cell.

Automated dataset label assignment procedure
For training or evaluating a Faster R-CNN model, every image must be accompanied by a BB associated with each MCF-7 cell present in the image. The manual generation of BBs can be a tedious, subjective, and labor-intensive task. Here, we propose a technique for generating the BBs automatically (see Fig. 7). In this technique, the MCF-7 cell signatures in the FITC mask are first localized utilizing the circular Hough transform algorithm. 44 The FITC mask background is less noisy and complex than its corresponding bright-field image and, therefore, a better option Fig. 6 The Faster R-CNN cell detection approach. The Faster R-CNN detection model consists of two main integrated modules: (i) a RPN that generates the region proposals (indicated by the orange bounding boxes in the image) that are most likely to include an MCF-7 cell, and (2) a Fast RCNN that labels the generated region proposals and outputs a BB for each detected MCF-7 cell. In the output image, the yellow boxes give the estimated bounding boxes assigned to the MCF-7 cell and the red boxes depict the corresponding ground truth.
to use for localization. For every localization, if not touching the image boundaries, the corresponding region in the DAPI mask is screened. If a signature exists in the screened region, then the corresponding localization is added to the MCF-7 binary mask. Lastly, for every blob in the MCF-7 binary mask, a 36 × 36 BB is generated and centered on that blob.
With this technique, 613 mixed population and pure cell images were annotated in less than 117 s, which is a significant improvement compared to manual annotation taking more than 100 s per image on average using MATLAB's Image Labeler tool. Additionally, this technique produced equal-sized BBs that are centered on the cells; an outcome that is difficult to achieve manually. Consequently, the proposed automatic dataset label assignment algorithm is a fast, efficient, and consistent technique to generate BB annotations for MCF-7 bright-field images.
For the Faster R-CNN cell detection approach, we utilized the same training images used in the decoupled cell detection approach. Utilizing dataset label assignment framework demonstrated in Fig. 7, we generated a training set comprising the 355 training images with a total of 2760 MCF-7 BBs. We employed the ResNet50 45  For the Faster R-CNN detection approach, the TPs, FNs, and FPs were defined in the same way as the decoupled approach, however they were interpreted differently due to the dissimilarities in the mechanism of the two approaches. All MCF-7 cells assigned a BB having IoU > 50% with respect to the annotation BB were counted toward TPs [ Fig. 8(a)]. An MCF-7 cell missing an estimated BB was considered a FN [ Fig. 8(b)]. There were two kinds of FPs as shown in Figs. 8(c) and 8(d): (1) redundant BBs assigned to a single MCF-7 and (2) non-MCF7 objects assigned a BB.

Performance comparison of the two approaches
As discussed in the previous sections, we developed two different frameworks that detect MCF-7 cells in the mixed cell population images. The first developed approach, i.e., decoupled cell detection, performed with 95.3% sensitivity and 99.8% precision on testing images. To achieve a better sensitivity rate, we pursued and presented an alternative cell detection approach, i.e., Faster R-CNN detection approach. Table 2 gives the statistical measurements and corresponding performance metrics of the two cell detection approaches on the testing images. Both approaches exhibit a comparable precision, whereas the sensitivity of the Faster R-CNN approach is better than the decoupled approach, indicating that the Faster R-CNN approach is less likely to miss MCF-7 cell detections. It is worth noting that unlike the decoupled approach, we cannot trace back the missed detections (FNs) specifically to the localization or classification modules in the Faster R-CNN model.
Comparing the performance of both approaches indicate that the Faster R-CNN cell detection approach is preferred because it provides better overall performance in detecting MCF-7 cells in roughly half the time. Finally, the cell detection framework proposed by Wang et al. 35 is comparable to our decoupled cell detection approach in that it too is comprised of cell localization and cell identification modules. However, their work does not offer feature analysis, nor do they assess the performance of each module independently. Moreover, our proposed deployed detection model, i.e., Faster R-CNN model, outperforms Wang et al.'s detection framework.

Impact of Image Intensity Transformations on the Faster R-CNN Approach
A major factor determining the effectiveness of a trained DL model is how well it performs when exposed to images with variations in the background intensity and contrast, as well as intensity variations within cells and cell boundaries. Such variations are not uncommon in experimental images due to variability in cell sample preparation and imaging conditions (e.g., focus position and exposure times). To understand the effect of such variations on the performance of Faster R-CNN model, we performed image transformations and assessed the performance of the Faster R-CNN approach on the transformed images. In addition to sensitivity and precision, F1-score was used to evaluate the overall performance of the Faster R-CNN detection model in the presence of image transformations. F1-score is defined as the harmonic average of recall and precision, evaluating the overall changes in the performance of the detection model, when deployed on a dataset. We performed image transformations and calculated the aggregate intensity distribution of a dataset by averaging the accumulated intensity distributions of all the images in the corresponding dataset. The intensity histogram of every image, hðIÞ, was altered in the following three ways: 1. Shift the image intensity histogram to generate a brighter version of the image. The offset (B 1 ) added to every pixel intensity is randomly drawn from a uniform distribution with bounds ½ meanðhðxÞÞ 4 ; meanðhðxÞÞ 2 . The altered image is modeled with I Int ðx; yÞ ¼ Iðx; yÞ þ B 1 with corresponding h Int ðIÞ. This transformation increases the mean of hðIÞ by 25% to 50%. The upper limit was set to avoid any pixel intensity saturation. The aggregate of all generated h Int ðIÞ is represented by H I . In addition to the above three transformations that were performed on the original image sets discussed in the previous sections, we also generated a new experimental image set by altering the focus and exposure time during image acquisition. The aggregate intensity distribution for this new data set is denoted by H newexpt .
In Fig. 9(a), we show the aggregate intensity distribution H TR and H TS for training and testing sets, respectively. As expected, the intensity distributions have similar mean and variance indicating that the detection framework is trained on a similar distribution as the testing set. In contrary, as shown in Figs. 9(b), 9(c), and 9(e), H I , H IC , and H newexpt have different mean values compared to H TR . Additionally, Fig. 9(c) shows notable differences in the variance of H IC and H TR . Note that Fig. 9(d) displays a mirroring effect between H TR and H N . This property suggests that, in contrast to the original training images, the cells in the negative images appear brighter than the background. Table 3 compares the performance of the Faster R-CNN approach on the original testing set, the altered versions of the original testing set, and the new experimental dataset. As shown, the detection model manifests its best performance, i.e., achieves highest F1-score, on the original testing set. This observation is rather intuitively comprehensible since the original testing dataset has a similar distribution as the original training set, and so, the detection model is trained for the images in the original testing set. Taking the performance of the model on the testing set as the reference, for 25% to 50% intensity variation and 125% to 525% variance or equivalently contrast change, the F1-score was decreased to <0.1% and 0.2%, respectively. These observations indicate that the Faster R-CNN approach robustly responds to the intensity and contrast variations. Note that the degradation in the performance of the Faster R-CNN approach extends as the transformed intensity distribution further deviates from the intensity distribution of the original training set. Based on the preceding observations, it can be fairly concluded that the detection model extracts features that are relatively robust to intensity and contrast variations. For the negative testing set, the precision noticeably decreased (−28.7%), causing the F1score to drop 16.8%, which shows that the model is considerably less robust to transformations that reverse the intensity map of the objects with respect to the background. However, when we retrained the Faster R-CNN model on training images and their negative versions and tested the retrained detection model on the negative of testing images, the performance recovered to 99.1% sensitivity, 97.3% precision, and F1-score of 98.2%. Thus, the detection model tends to show vulnerability toward contrast reversal transformations. However, if trained on both the image and its negative version, the detection model manages to learn features of the cell that are independent of the cell versus background contrast.

Application of Deep Learning Techniques to Live Patient-Derived CTC Detection
Our results show that the DL-based Faster R-CNN model can successfully classify live in vitro breast cancer cells in a background of WBCs. This outcome parallels the success that DL models are achieving in the fields of biomedical image analysis and clinical diagnostics. 46,47 These advances have led to successful efforts in applying DL to detect patient-derived CTCs in fluorescently stained blood samples. 33,35 Applying similar techniques for detecting live CTCs in unstained patient blood samples would open new opportunities (e.g., ex vivo expansion of CTCs for drug screening and molecular analysis). However, application of DL techniques for label-free detection of patient-derived CTCs presents challenges as we discuss below. For supervised machine learning-based models, generation of valid ground truth data is essential. In the case of patient-derived CTCs, generation of ground truth images requires live-cell fluorescent markers that are specific to CTCs but do not stain blood cells. However, tagging CTCs with live-cell fluorescent labels has been a challenge in the field, although some progress is being made. For example, Wang et al. used carbonic anhydrase 1× PE-conjugated antibody along with a live cell dye, calcein acetoxymethyl ester to detect live CTCs in metastatic renal cell carcinoma patients. 35 In another study, a group of near-infrared (IR) heptamethine carbocyanine dyes were used to identify viable CTCs recovered from prostate cancer patients. 48 These dyes were shown to actively tag cancer cells in xenograft models and have since begun to be used as live markers to further improve cancer prognosis and treatment efficacy. 49 Recently, positive selection approaches involving the use of cell surface markers, such as HER2 (human epidermal growth factor receptor-2), EpCAM (epithelial cell adhesion molecule), and EGFR (epidermal growth factor receptor) have also been implemented to recover viable CTCs. 50 Another challenge for generating ground truth images is the heterogeneity of patient-derived CTCs. During the metastatic cascade, primary tumor cells that are of epithelial origin become migratory acquiring a mesenchymal phenotype. 11,51,52 Studies with cancer patient blood show that some of the isolated CTCs have high epithelial marker expression while others have high mesenchymal marker expression or a combination of both. [53][54][55][56] This heterogeneity in marker expression requires multiple fluorescent labels to tag live CTCs and capture the diversity present in patient blood and generate ground truth data.
Finally, machine learning-based approaches often require large data sets for training. Since CTCs are present in extremely low counts in patient blood, generating large, annotated data sets requires access to many patients' blood samples. Collecting blood samples from a large patient population can be a time-consuming, expensive, and a logistically challenging task. A potential avenue to address these challenges in label-free live detection of CTCs is to develop robust DL-based approaches to detect blood cells in patient samples and score the remaining non-blood cells as prospective CTCs. This negative selection-based approach will at least help in quickly screening, which patient blood samples are of most interest for further scrutiny.

Conclusions
In this work, we proposed an automated framework for label-free detection of MCF-7 breast cancer cells in a background of WBCs in bright-field images. An effective Faster R-CNN-based detection model was developed for detecting MCF-7 cells in the acquired bright-field images. The proposed model demonstrated 99.1% sensitivity and 99.8% specificity, and an average IoU of >80%. The MCF-7 cell detection model analyzed each bright-field image in <0.3 s, more than 300× faster than a human labeler. Also, we introduced a novel fully automated technique for training set label assignment.
Additionally, we conducted multiple studies to investigate the discriminatory features that an effective CNN derives and employs to differentiate MCF-7 cells and WBCs. These studies showed that the size of the cells was the main distinctive feature that the CNN used to distinguish MCF-7 cells from WBCs. However, in the absence of the size feature, the CNN was still capable of learning other features to perform the identification task, with an acceptable, yet decreased, accuracy level.
Finally, we examined the performance of the detection model in the presence of numerous image intensity transformations. The results demonstrated that for intensity and contrast variations, the F1-score of the detection model was reduced by < 0.2%. Therefore, the MCF-7 cell detection model was sufficiently robust with respect to intensity and contrast image transformations. These observations inform that the detection model uses intensity-and contrast-invariant features to perform the detection task.
The results in this work indicate that DL approaches could be potentially applied to detect CTCs in the blood of cancer patients. In the future, challenges related to live-cell and multimarker fluorescent labeling of patient-derived CTCs and community-wide access to large, labeled datasets need to be addressed.

Disclosures
No conflicts of interest, financial or otherwise, are declared by the authors. Golnaz Moallem is a postdoctoral research fellow in the Personalized Integrative Medicine Laboratory, Department of Radiology, Stanford University. She received her BS degree in electrical engineering from Isfahan University of Technology, and her MS and PhD degrees in electrical engineering from Texas Tech University. Her research interests include machine learning and computer vision applications in clinical data analysis, particularly developing vision systems for medical image analysis, including image classification and object detection.
Adity A. Pore is a graduate research assistant in the Vanapalli Lab at Texas Tech University. She received her BS and MS degrees in chemical engineering from Gujarat Technological University and Texas Tech University, respectively. She is currently pursuing a PhD in chemical engineering. Her current project focuses on isolation and characterization of rare cancer associated cells from the blood of breast cancer patients which could be useful for predicting treatment outcomes.
Anirudh Gangadhar is currently a PhD candidate in chemical engineering from Texas Tech University. Previously, he has received his MS degree in the same major from the University of Florida. His research interests include digital holographic microscopy, label-free screening of cancer cells in blood and machine learning.
Hamed Sari-Sarraf received his PhD in electrical engineering from the University of Tennessee in 1993. He is currently a professor of electrical engineering at Texas Tech University. His area of interest is applied machine vision and learning.
Siva A. Vanapalli received his PhD in chemical engineering from the University of Michigan.
He is currently a professor in chemical engineering and the Bryan Pearce Bagley Regents Chair in engineering at Texas Tech University. His research interests are in microfluidics, fluid dynamics, complex fluids, and machine learning, which are applied to biomedical applications in cancer and healthy aging.